This data set is about the chemical make up and quality test scores of red wine. I am setting out to find what qualities might show a relationship to taste test scores and the various chemical attributes’ relative correlations.
Citation #1 http://rpubs.com/Daria/57835 Red and White Wine Quality by Daria Alekseeva
Citation #2 This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
ABOUT THE DATA SET
For more inforamtion about this data set please see wineQualityinfo.txt
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
There is 1599 objects and 13 variables in this set. If we look at the names: [1] “X” “fixed.acidity”
[3] “volatile.acidity” “citric.acid”
[5] “residual.sugar” “chlorides”
[7] “free.sulfur.dioxide” “total.sulfur.dioxide” [9] “density” “pH”
[11] “sulphates” “alcohol”
[13] “quality”
We can imagine that pH, acidities, and alcohol might be be naturally related. X looks to just be a key identifier number and is not needed. The strongest alcohol content is 14.90 percent and the lowest is 8.4 percent. The median quality score is a 6 with a mean of 5.636 and a max of 8. The lowest quality score was a 3 yuck!
I would also bet that sulfur dioxide is negativly related to quality score since sulfites are often found in lower quality wines as far as I know.
In the following twelve graphs, I am taking a first look at the variables. I will mostly be looking for abnormal distributions and outliers.
The distributions that seemed abnormal were: Alcohol, Residual Sugar, Free Sulfur Dioxide, Sulphates, Volatile Aciditiy, Citric Acid, Chlorid, and Total Sulfur Dixoide. The most common issue was a left skew, followed by long tails and some with extreme outliers.
I used ggpairs to analyze the correlation and distributions of the variables. Scatter plots varied, but for our main variable that I am interested in, “Quality”, you can see the distribution forms in stripes due to the fact that scores were whole numbers 3,4,5,6, 7, or 8.
The most related of all the paired variables were: 1. Fixed.acidity and pH, with a corrcoef of -.68 2. Density and Fixed.acidity with a corrcoef of .68 3. Total sulfur dioxide and free dioxide with a corrcoef of .67 though I believe that free dioxide may be a subset of total sulfur dioxide.
The most related of all the variables to Quality: 1. Alcohol to Quality with a corrcoef .48 2. Sulphates to Quality with a corrcoef .25 3. Citric.Acid to Quality with a corrcoef .23
Wow, who knew that 48% of a taste test result could potentially be explained by the quantity of alcohol?
Treating quality as a categorical variable allows us to see some really interesting scatter plots. We can see beyond the .48 correlation coefficient and say that the mean alcolhol levels for 7 and 8 wines was much higher than the rest, but now always. However, 3 and 4 quality wines had a higher mean alcohol than 5 wines which had a very tight quartile range comapred to the other categories and had the lowest mean. Finally citric acid showed some of the most dramatic differences in means from one category to the next although the usefulness for prediction was less evident as we saw more crossing and convergence in the quality trend lines.
The above plot shows the various wine alcohol contents for our sample across the range of quality scores. There is a clear positive correlation, though not very strong. A 7 or 8 quality wine rarely has an alcohol content under 10 percent. I used an alpha of 1/8 to reduce over plotting.
Repeating the box plot for the third most related variable, citric.acid, shows suprisingly that the wine qualities of 7 and 8 have vastly higher amounts of citric acid. If you are trying to pick a top quality wine, 7 and up, this variable is about as important as alcohol even though the overall correlation coefficients were very different when considering all scores. If I were to further analyze this, I may group citric acid scores into 3 groups low, middle, high if I were going to use it in a prediction algorythm since the means for the 3 & 4, 5 & 6, 7 & 8 are very similar to eachother. There are some outliers, and most are in the 7 category.
This graph illustrates the relationship of quality to its two most correlated values: Alcohol and sulphates. It appears that to be an 8 quality score wine, the wine must reside in the upper right quadrant and have high citric acid levels. The 8 wines almost always have to have an alcohol content above 11 and have at least a .2 citric acid and sulphates of at least .6. 3 and 4’s are the most diverse but 5’s typically have a much higher alcohol content - as high or higher than most 8’s but they lack in the other factors like sulphates and (citric acids not shown above).
I was suprised that you could take a subjective score, albiet from trained testers, and have muiltiple facets of the objective data correlate. When I used ggpairs, I saw that about half of the variables had fairly normal distributions. On the other hand, many had left skew as well. I think it is extremely important to make sure you as to whether the data is categorical or continuous.
This data set’s smaple size is not huge and contains one wine from one county. Therefore, we would need more data to create a meaningful prediction model. However, it does show promise for predicting wine quality based on chemical factors given a more expansive sample.
When it comes to quality taste score, the data is clear that alcohol plays a clear roll; however, the signifigance of the suphates and citric acid levels is debatable unless you apply it to specific quality categories. One of the issues I encountered was that quality score was actually a categorical value.
So, eventually, I converted the data to categorical. However, after the last review, I was informed that ggplot could take the info and treat it as categorical thus solving the problem. Additionally, some of the variables seemed to be highly related, especially the following 3: 1. Fixed.acidity and pH, with a corrcoef of -.68 (acidity seems to be part of the equation for pH) 2. Density and Fixed.acidity with a corrcoef of .68 (acid appears to be more dense than the rest of the chemical make up) 3. Total sulfur dioxide and free dioxide with a corrcoef of .67 (I think free dioxide is a subset of total sulfur dioxide)
For future studies, I would recommend analyzing price and grouping categories 3-5 into a low quality wine category to reduce noise and increase the correlation for those wines of a 6 and up quality score.